Statistical Language Models for Intelligent XML Retrieval

نویسنده

  • Djoerd Hiemstra
چکیده

The XML standards that are currently emerging have a number of characteristics that can also be found in database management systems, like schemas (DTDs and XML schema) and query languages (XPath and XQuery). Following this line of reasoning, an XML database might resemble traditional database systems. However, XML is more than a language to mark up data; it is also a language to mark up textual documents. In this chapter we specifically address XML databases for the storage of ‘document-centric’ XML (as opposed to ‘data-centric’ XML [42]). Document-centric XML is typically semi-structured, that is, it is characterised by less regular structure than data-centric XML. The documents might not strictly adhere to a DTD or schema, or possibly the DTD or schema might not have been specified at all. Furthermore, users will in general not be interested in retrieving data from document-centric XML: They will be interested in retrieving information from the database. That is, when searching for documents about “web information retrieval systems”, it is not essential that the documents of interest actually contain the words “web”, “information”, “retrieval” and “systems” (i.e., they might be called “internet search engines”). An intelligent XML retrieval system combines ‘traditional’ data retrieval (as defined by the XPath and XQuery standards) with information retrieval. Essential for information retrieval is ranking documents by their probability, or degree, of relevance to a query. On a sufficiently large data set, a query for “web information retrieval systems” will retrieve many thousands of documents that contain any, or all, of the words in the query. As users are in general not willing to examine thousands of documents, it is important that the system ranks the retrieved set of documents in such a way that the most promising documents are ranked on top, i.e. are the first to be presented to the user. Unlike the database and XML communities, which have developed some well-accepted standards, the information retrieval community does not have

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Language Modeling Approaches to Information Retrieval

This article surveys recent research in the area of language modeling (sometimes called statistical language modeling) approaches to information retrieval. Language modeling is a formal probabilistic retrieval framework with roots in speech recognition and natural language processing. The underlying assumption of language modeling is that human language generation is a random process; the goal ...

متن کامل

Prototyping a Vibrato-Aware Query-By-Humming (QBH) Music Information Retrieval System for Mobile Communication Devices: Case of Chromatic Harmonica

Background and Aim: The current research aims at prototyping query-by-humming music information retrieval systems for smart phones. Methods: This multi-method research follows simulation technique from mixed models of the operations research methodology, and the documentary research method, simultaneously. Two chromatic harmonica albums comprised the research population. To achieve the purpose ...

متن کامل

Comparing XML-IR Query Formation Interfaces

XML information retrieval (XML-IR) systems differ from traditional information retrieval systems by using structure of XML documents to retrieve more specific units of information than the documents themselves. Users interact with XML-IR systems via structured queries that express their content and structural requirements. Historically, it has been common belief within the XML-IR community that...

متن کامل

Building and Searching an XML-Based Corporate Memory

NO MATTER WHO USES A CORporate memory or how it is constructed, information search through that memory should be efficient and effective. In particular, it should adapt to the users’needs, activities, and work environments. For a document-based corporate memory distributed through the Web, which is our research area, these requirements raise two main questions: How will we describe the document...

متن کامل

XML Retrieval Models for Legislation

Legislation contains text-rich documents and is increasingly marked with XML tags. The XML markup can among other uses be exploited to more precisely answer free information queries. In this article we report on different XML retrieval models we explicitly designed for the retrieval of legislation and which are based on the vector space model and the probabilistic language model. In addition se...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2003